Multi-camera 3D content creation

ABSTRACT

Techniques for generating three-dimensional content from the recordings of multiple independently operated cameras that are not constrained to fixed positions and orientations are disclosed. In some embodiments, data from a plurality of cameras configured to capture a scene is received; a relative pose of each camera with respect to the scene is determined based at least in part on a first estimate and a second estimate, wherein the first estimate is based on image data and the second estimate is based on sensor data; relative poses of cameras with respect to one or more other cameras are determined based on determined relative poses of individual cameras with respect to the scene; and a three-dimensional reconstruction of at least a portion of the scene is generated based on the received data and determined relative poses.

CROSS REFERENCE TO OTHER APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/472,338, now U.S. Pat. No. 10,791,319, entitled MULTI-CAMERA 3DCONTENT CREATION filed Aug. 28, 2014, which claims priority to U.S.Provisional Patent Application No. 61/871,258 entitled MULTI-CAMERA 3DCONTENT CREATION filed Aug. 28, 2013, both of which are incorporatedherein by reference for all purposes.

BACKGROUND OF THE INVENTION

Various techniques exist for three-dimensional content creation.Traditional approaches for generating three-dimensional content mayemploy a single or a plurality of cameras as further described below.

A single camera with known pose information moving around a staticobject or scene may be employed to create a three-dimensionalperspective based on correspondence of images from the camera atdifferent angles or times. Such a single camera approach is limited tostatic or still objects and scenes.

Two or more cameras may be employed to generate static or movingthree-dimensional content. In such cases, relative poses of the camerasare established by fixed and predetermined geometric constraints. Forexample, the cameras may be physically mounted at prescribed positionsand orientations to record a scene. Correspondence between two or moreperspectives of the scene is used to generate a three-dimensionalrepresentation of the scene. Such a multi-camera approach may beemployed to generate three-dimensional content that includes motion.

As described, traditional techniques for three-dimensional contentgeneration rely on apriori knowledge of camera pose and/or relativepose.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 is a diagram illustrating an embodiment of a manner in whichcontent captured by a plurality of cameras may be reconstructed inthree-dimensions.

FIG. 2 is a flow chart illustrating an embodiment of a process forgenerating a three-dimensional reconstruction of a scene.

FIG. 3 is a flow chart illustrating an embodiment of a process forfacilitating multi-camera coarse alignment.

FIG. 4 is a flow chart illustrating an embodiment of a process forfacilitating single camera fine alignment.

FIG. 5 is a flow chart illustrating an embodiment of a process forfacilitating multi-camera fine alignment.

DETAILED DESCRIPTION

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims,and the invention encompasses numerous alternatives, modifications, andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example, andthe invention may be practiced according to the claims without some orall of these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

Techniques for synchronizing content from a plurality of independentlyoperated cameras that have captured possibly different perspectives ofthe same scene to generate at least a partial three-dimensional (3D)reconstruction of the recorded scene are disclosed. As further describedherein, no foreknowledge exists of the relative pose (i.e., position andorientation) of each camera with respect to the scene as well as therelative poses between cameras. Rather, relative poses are determinedusing the disclosed techniques. By using multiple available cameraperspectives of a scene and their correspondence based on independentlyestimated pose data, portions of the recorded scene where more than oneview is available may be reconstructed in static and/or moving 3D asapplicable.

FIG. 1 is a diagram illustrating an embodiment of a manner in whichcontent captured by a plurality of cameras may be reconstructed inthree-dimensions. In the given example, a scene 100 comprising aplurality of objects is simultaneously filmed from various angles andperspectives by a plurality of mobile phone cameras 102(a)-(h). Theoperators of the mobile phones are not constrained to fixed positionsand orientations but instead are free to move while recording scene 100.Thus, the relative pose of each camera 102 with respect to scene 100 aswell as the relative poses between cameras 102 are not fixed or knownapriori and furthermore are time variant.

Recordings from the various participating cameras 102 are communicatedto a global processing system 104 configured to synchronize therecordings to generate a three-dimensional reconstruction of therecorded scene 100. Synchronization may be performed with respect to aplurality of time slices of the recordings to determinethree-dimensional motion of the scene 100. As further described below,correspondence of different camera perspectives is based on estimatingrelative camera pose with respect to the scene 100 for each camera 102as well as relative poses between cameras 102 using feature trackingwith respect to the recordings of cameras 102 and/or associated inertialsensor data.

In the example of FIG. 1, various data collection, processing, and/orcommunication functionalities associated with three-dimensional scenereconstruction are facilitated via a mobile phone application comprisinga client-side application at the mobile phone associated with eachcamera 102 and a server-side application at global processing system104. In other embodiments, a plurality of cameras and a global processormay be arranged in any other appropriate configuration. Moreover, thedisclosed techniques may be generally employed with respect to imageand/or video data for static and/or motion-based three-dimensionalreconstruction.

FIG. 2 is a flow chart illustrating an embodiment of a process forgenerating a three-dimensional reconstruction of a scene. For example,process 200 may be employed by global processing system 104 of FIG. 1.Process 200 starts at step 202 at which a plurality of camerasconfigured to (simultaneously) capture a prescribed scene is identified.In some cases, it may be known that the plurality of cameras isconfigured to capture the prescribed scene, for example, if the cameras(and/or associated mobile phones and/or applications) have specificallybeen instructed (e.g., by a server-side application) to capture theprescribed scene and/or have otherwise indicated participation incapturing the scene. In some cases, it may be manually and/orautomatically determined that the plurality of cameras is capturing thesame scene, e.g., based on identifying that the same features and/orfiducials comprising the scene have been captured by the cameras. Invarious embodiments, the scene may comprise one or more static and/ormoving objects.

At step 204, data from the plurality of cameras is received. Forexample, the data received from each camera may comprise a videorecording and/or a sequence of images of the scene. A received image,for instance, may comprise a frame of a video recording at a particularmoment in time. Furthermore, the data received at step 204 from eachcamera may include sensor data, for example, from an associated mobilephone. Such sensor data may include Global Positioning System (GPS) dataas well as other inertial and/or orientation data obtained from anassociated compass, accelerometer, etc. In some cases, the sensor datacomprises three translational coordinates and three rotational anglesrepresenting a combined six degrees of freedom. In some embodiments, thedata received at step 204 is time stamped so that data from theplurality of cameras for a given time slice may be processed together.

At step 206, a common set of points in the scene that are likely to havebeen captured by most, if not all, of the plurality of cameras isidentified. For example, video frames associated with a prescribed timeslice captured by the plurality of cameras may be manually and/orautomatically searched at step 206 to identify commonly captured areasof the scene and the common set of points. In some embodiments, thecommon set of points comprises distinct features and/or fiducials of thescene or parts thereof that may, for example, be used for registration.In some embodiments, the common set of points comprises at least fournon-coplanar points.

At step 208, a relative pose of each camera with respect to the commonset of identified points in the scene is determined. In variousembodiments, the relative pose of each camera may be determined withrespect to any combination of one or more points in the common set ofpoints. The image and/or video data from each camera may be processedusing feature extraction, tracking, and/or correspondence algorithms togenerate an estimate of camera pose with respect to image content, i.e.,the common set of identified points. Moreover, sensor data associatedwith the camera may be employed to verify and/or provide a parallelestimate of individual camera pose as well as fill in gaps for poseestimate when pose cannot be obtained visually, e.g., during instancesof time when the camera field of view does not include the scene underconsideration and/or the common set of identified points. Furthermore,motion estimates from sensor data may be employed to calculate theapproximate view from each camera as a function of time. At step 210,relative poses of the plurality of cameras with respect to each otherare determined based on the relative poses of each camera with respectto the common set of identified points.

At step 212, images or frames captured by the cameras as well as thedetermined relative poses of the cameras are processed to derive athree-dimensional reconstruction of the scene in the common field ofview of the cameras. For example, step 212 may comprise calculatingcorrespondence between images comprising a frame (i.e., imagesassociated with a particular time slice) that have been captured by theplurality of cameras. Once correspondence is calculated, the imagescorresponding to a frame or time slice can be rectified, and depthinformation for tracked features can be computed. A sparse point cloudcan then be used to guide calculation of a dense surface inthree-dimensional space.

In various embodiments, process 200 may be employed forthree-dimensional photography and/or videography. For the latter, thesteps of process 200 may be iterated for each of a sequence of timeslices, e.g., to generate a three-dimensional video of a scene. Process200 comprises a general process for three-dimensional reconstruction.FIGS. 3-5 collectively describe three specific successive processes thatin some embodiments may be iterated one or more times to improveregistration among the cameras. As further described below, the threeprocesses facilitate multi-camera coarse alignment, single camera finealignment, and multi-camera fine alignment.

FIG. 3 is a flow chart illustrating an embodiment of a process forfacilitating multi-camera coarse alignment. Process 300 starts at step302 at which it is determined that all participating cameras are viewingthe same subject, such as prescribed features and/or fiducials of ascene. In some cases, it may be assumed at step 302 that allparticipants are viewing the same subject. Such an assumption may bemade, for instance, in response to an organizer of a recording sessioninitiating a recording event by prompting all participants to view thesame subject as the organizer. At step 304, for each camera, a recordingof the subject as well as associated sensor data is received. The sensordata may comprise GPS (i.e., latitude/longitude) coordinates as well asangular orientations (i.e., roll, pitch, yaw). At step 306, for eachcamera, an initial estimate of relative pose with respect to the subjectis computed, for example, by assuming that the center of the field ofview of each camera is pointing to the same place in space. At step 308,a cost function is created that is based on minimizing the totaldistance between the vector specifying the line of sight of each cameraand the location of the subject. The cost function may be subject toconstraints such as preservation of the angular orientations as well asbounding the absolute positions of an associated mobile device withinthe noise level of the GPS coordinates. In various embodiments, theconstraints may be held as ground truth (i.e., strict constraints) orbuilt into the cost function as soft constraints where violationsintroduce increases to the cost function. At step 310, the cost functionis minimized to determine an updated set of coordinates for each camera,which provides an updated estimate of relative pose for each camera withrespect to the recorded subject. In various embodiments, the costfunction may be formulated as a linear or a non-linear optimizationproblem with respect to which any appropriate methods may be used tosolve for the updated pose.

Once coarse poses are computed using process 300, the position/pose ofeach camera relative to the subject may be refined by utilizingtraditional computer vision algorithms by taking advantage of smallchanges in perspective introduced by a user while recording a subject.Slight changes in camera position generate slightly differentperspectives of the same subject which can be used in combination withknown camera parameters to estimate distance and pose relative to thesubject. For each camera, small motions should be introduced eitherdeliberately or inadvertently by an associated user so that the samefeatures in different frames of the video sequence appear in differentlocations or with different relative spacings. For a subject with afeature-rich appearance (i.e., a subject with distinct markings such asedges, corners, complex surface patterns, etc.), many features detectedin one frame can be identified in other subsequent frames in the video.Standard methods, including the standard SIFT (scale-invariant featuretransform) methods may be applied. FIG. 4 is a flow chart illustratingan embodiment of a process for facilitating single camera finealignment. In some embodiments, process 400 is iterated for each camera.Process 400 starts at step 402 at which correspondence is establishedfor features between at least two frames within a recorded sequencecaptured by a prescribed camera. At step 404, the estimate of therelative pose of the camera with respect to the subject is updated.Specifically, once correspondence is established for features between atleast two of the frames of a recorded sequence captured by an individualdevice, the relative pose between the two (or more positions) as well asthe relative positions of the features identified in 3D space can beestimated simultaneously. Absolute distances and scale between multipleviews are calculated by either estimating the focal lengths using knownmethods or by foreknowledge of the camera parameters being used torecord the sequence.

Given the coarse and fine alignments achieved using processes 300 and400, the initial estimates for pose and distance for each individualcamera can be extended to create correspondence between sets of featuresseen in common among multiple cameras. FIG. 5 is a flow chartillustrating an embodiment of a process for facilitating multi-camerafine alignment. Process 500 starts at step 502 at which initial poseestimates (e.g., obtained via processes 300 and 400) are employed toidentify closest matching views (i.e., nearest neighbor cameras). TheSIFT framework may be used to automatically identify devices that are inclose proximity or that share similar perspectives. At step 504, a finecorrespondence of pose is generated among all participating cameras. Insome embodiments, nearest neighbor pairs are automatically connected insequence. For example, correspondence between camera 1 and its nearestneighbor camera 2 is estimated, subsequently correspondence betweencamera 2 and its nearest neighbor camera 3 is estimated, and so on untilall cameras have relative correspondences. Such an approach may beextended and improved by utilizing multiple nearest neighbor camerassimultaneously (e.g., camera 1 tied together with camera 2 and camera 3,camera 2 tied together with camera 3 and camera 4, etc.). For cases inwhich a camera cannot be related to another camera through automatedcorrespondence, features which are known to be common between camerasmay be manually tagged by a user. For a sequence comprising a largenumber of image frames, manual tagging may be applied to an intermittentnumber of frames while automated tracking maintains correspondencebetween manually tagged frames. At step 506, standard rectificationalgorithms are applied to generate a three-dimensional representation.That is, once all camera views have correspondence amongst a largenumber of frames within the recorded sequences, standard rectificationmethods may be applied between neighboring camera views, e.g., to warpthe images so that the epipolar lines are parallel. Step 506 may includecomputing disparity maps between views or camera pairs and estimatingdepths of points on images. The resulting depth maps or 3D point clouds,however, may be fairly sparse. Disparity map accuracy may be increasedby introducing more than two views or cameras per calculation. Densecorrespondence (i.e., a filled in 3D model) may be estimated usingstandard methods.

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

What is claimed is:
 1. A system, comprising: a processor configured to:receive data from a plurality of cameras configured to capture a scene,wherein no foreknowledge exists of relative pose of each camera withrespect to the scene and relative poses between cameras; for each of theplurality of cameras, determine a relative pose of a given camera withrespect to the scene based at least in part on a first estimate and asecond estimate, wherein the first estimate of camera pose with respectto the scene is based on image data received from the given camera andthe second estimate of camera pose with respect to the scene is based onsensor data received from the given camera and wherein the secondestimate is employed to verify and provide a parallel estimate to thefirst estimate; determine relative poses of cameras with respect to oneor more other cameras comprising the plurality of cameras based ondetermined relative poses of individual cameras with respect to thescene; and generate a three-dimensional reconstruction of at least aportion of the scene based on the received data and determined relativeposes; and a memory coupled to the processor and configured to provideinstructions to the processor.
 2. The system of claim 1, wherein theplurality of cameras is independently operated.
 3. The system of claim1, wherein relative time variant poses of the cameras with respect tothe scene and with respect to each other are not fixed and known apriori.
 4. The system of claim 1, wherein determined relative camerapose with respect to the scene is based on a common set of identifiedpoints in the scene.
 5. The system of claim 1, wherein the secondestimate is employed to fill gaps in pose estimation when pose cannot bedetermined from the first estimate.
 6. The system of claim 1, whereindata is received and relative poses are determined for each of aplurality of times slices.
 7. The system of claim 1, wherein thethree-dimensional reconstruction of at least the portion of the scenecomprises a static or a motion-based three-dimensional reconstruction.8. The system of claim 1, wherein received data comprises image data andsensor data.
 9. The system of claim 1, wherein received data is timestamped.
 10. The system of claim 1, wherein sensor data comprises GlobalPositioning System (GPS) coordinates.
 11. The system of claim 1, whereinsensor data comprises latitude and longitude coordinates and angularorientations.
 12. The system of claim 1, wherein sensor data comprisesthree translational coordinates and three rotational angles comprising acombined six degrees of freedom.
 13. The system of claim 1, wherein togenerate the three-dimensional reconstruction comprises to minimize acost function subject to one or more constraints.
 14. The system ofclaim 1, wherein to generate the three-dimensional reconstructioncomprises to establish correspondence of features between frames of avideo sequence.
 15. The system of claim 1, wherein to generate thethree-dimensional reconstruction comprises to identify closest matchingviews from nearest neighbor cameras.
 16. The system of claim 1, whereindata is received from a client-side application and thethree-dimensional reconstruction is generated by a server-sideapplication.
 17. A method, comprising: receiving data from a pluralityof cameras configured to capture a scene, wherein no foreknowledgeexists of relative pose of each camera with respect to the scene andrelative poses between cameras: for each of the plurality of cameras,determining a relative pose of a given camera with respect to the scenebased at least in part on a first estimate and a second estimate,wherein the first estimate of camera pose with respect to the scene isbased on image data received from the given camera and the secondestimate of camera pose with respect to the scene is based on sensordata received from the given camera and wherein the second estimate isemployed to verify and provide a parallel estimate to the firstestimate: determining relative poses of cameras with respect to one ormore other cameras comprising the plurality of cameras based ondetermined relative poses of individual cameras with respect to thescene: and generating a three-dimensional reconstruction of at least aportion of the scene based on the received data and determined relativeposes.
 18. A computer program product embodied in a non-transitorycomputer readable storage medium and comprising computer instructionsfor: receiving data from a plurality of cameras configured to capture ascene, wherein no foreknowledge exists of relative pose of each camerawith respect to the scene and relative poses between cameras: for eachof the plurality of cameras, determining a relative pose of a givencamera with respect to the scene based at least in part on a firstestimate and a second estimate, wherein the first estimate of camerapose with respect to the scene is based on image data received from thegiven camera and the second estimate of camera pose with respect to thescene is based on sensor data received from the given camera and whereinthe second estimate is employed to verify and provide a parallelestimate to the first estimate: determining relative poses of cameraswith respect to one or more other cameras comprising the plurality ofcameras based on determined relative poses of individual cameras is withrespect to the scene: and generating a three-dimensional reconstructionof at least a portion of the scene based on the received data anddetermined relative poses.
 19. The method of claim 17, wherein theplurality of cameras is independently operated.
 20. The method of claim17, wherein relative time variant poses of the cameras with respect tothe scene and with respect to each other are not fixed and known apriori.
 21. The method of claim 17, wherein determined relative camerapose with respect to the scene is based on a common set of identifiedpoints in the scene.
 22. The method of claim 17, wherein the secondestimate is employed to fill gaps in pose estimation when pose cannot bedetermined from the first estimate.
 23. The method of claim 17, whereindata is received and relative poses are determined for each of aplurality of times slices.
 24. The method of claim 17, wherein thethree-dimensional reconstruction of at least the portion of the scenecomprises a static or a motion-based three-dimensional reconstruction.25. The method of claim 17, wherein received data comprises image dataand sensor data.
 26. The method of claim 17, wherein received data istime stamped.
 27. The method of claim 17, wherein sensor data comprisesGlobal Positioning System (GPS) coordinates.
 28. The method of claim 17,wherein sensor data comprises latitude and longitude coordinates andangular orientations.
 29. The method of claim 17, wherein sensor datacomprises three translational coordinates and three rotational anglescomprising a combined six degrees of freedom.
 30. The method of claim17, wherein generating the three-dimensional reconstruction comprisesminimizing a cost function subject to one or more constraints.
 31. Themethod of claim 17, wherein generating the three-dimensionalreconstruction comprises establishing correspondence of features betweenframes of a video sequence.
 32. The method of claim 17, whereingenerating the three-dimensional reconstruction comprises identifyingclosest matching views from nearest neighbor cameras.
 33. The method ofclaim 17, wherein data is received from a client-side application andthe three-dimensional reconstruction is generated by a server-sideapplication.
 34. The computer program product of claim 18, wherein theplurality of cameras is independently operated.
 35. The computer programproduct of claim 18, wherein relative time variant poses of the cameraswith respect to the scene and with respect to each other are not fixedand known a priori.
 36. The computer program product of claim 18,wherein determined relative camera pose with respect to the scene isbased on a common set of identified points in the scene.
 37. Thecomputer program product of claim 18, wherein the second estimate isemployed to fill gaps in pose estimation when pose cannot be determinedfrom the first estimate.
 38. The computer program product of claim 18,wherein data is received and relative poses are determined for each of aplurality of times slices.
 39. The computer program product of claim 18,wherein the three-dimensional reconstruction of at least the portion ofthe scene comprises a static or a motion-based three-dimensionalreconstruction.
 40. The computer program product of claim 18, whereinreceived data comprises image data and sensor data.
 41. The computerprogram product of claim 18, wherein received data is time stamped. 42.The computer program product of claim 18, wherein sensor data comprisesGlobal Positioning System (GPS) coordinates.
 43. The computer programproduct of claim 18, wherein sensor data comprises latitude andlongitude coordinates and angular orientations.
 44. The computer programproduct of claim 18, wherein sensor data comprises three translationalcoordinates and three rotational angles comprising a combined sixdegrees of freedom.
 45. The computer program product of claim 18,wherein generating the three-dimensional reconstruction comprisesminimizing a cost function subject to one or more constraints.
 46. Thecomputer program product of claim 18, wherein generating thethree-dimensional reconstruction comprises establishing correspondenceof features between frames of a video sequence.
 47. The computer programproduct of claim 18, wherein generating the three-dimensionalreconstruction comprises identifying closest matching views from nearestneighbor cameras.
 48. The computer program product of claim 18, whereindata is received from a client-side application and thethree-dimensional reconstruction is generated by a server-sideapplication.